Vector representation based on a supervised codebook for Nepali documents classification

نویسندگان

چکیده

Document representation with outlier tokens exacerbates the classification performance due to uncertain orientation of such tokens. Most existing document methods in different languages including Nepali mostly ignore strategies filter them out from documents before learning their representations. In this article, we propose a novel method based on supervised codebook represent documents, where our contains only semantic without outliers. Our is domain-specific as it given corpus that have higher similarities class labels corpus. adopts simple yet prominent for each word, called probability-based word embedding. To show efficacy method, evaluate its task using Support Vector Machine and validate against widely used Bag Words, Latent Dirichlet allocation, Long Short-Term Memory, Word2Vec, Bidirectional Encoder Representations Transformers so on, four text datasets (we denote shortly A1, A2, A3 A4). The experimental results produces state-of-the-art (77.46% accuracy 67.53% 80.54% 89.58% A4) compared methods. It yields best three (A1, A2 A3) comparable fourth dataset (A4). Furthermore, introduce largest (A4), NepaliLinguistic dataset, linguistic community.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Semi-Supervised Classification Based on Low Rank Representation

Graph-based semi-supervised classification uses a graph to capture the relationship between samples and exploits label propagation techniques on the graph to predict the labels of unlabeled samples. However, it is difficult to construct a graph that faithfully describes the relationship between high-dimensional samples. Recently, low-rank representation has been introduced to construct a graph,...

متن کامل

Text Classification Based On Manifold Semi- Supervised Support Vector Machine

This article presents a solution along with experimental results for an application of semi-supervised machine learning techniques and improvement on the SVM (Support Vector Machine) based on geodesic model to build text classification applications for Vietnamese language. The objective here is to improve the semi-supervised machine learning by replacing the kernel function of SVM using geodesi...

متن کامل

Efficient Vector Representation for Documents through Corruption

We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors infor...

متن کامل

An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification

Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: PeerJ

سال: 2021

ISSN: ['2167-8359']

DOI: https://doi.org/10.7717/peerj-cs.412